Big Data project report

For course: I535 - MGMT ACCESS USE BIG DATA

Instructor: Inna Kouper

By Sesha Sai Krishna Valluri

Section 1: Introduction

In this project, i plan to build a data pipeline to extract the sentiment of covid vaccines from twitter and correlate it with the extent of vaccinations in the country. In order to achieve this i extract the tweets associated with the covid vaccines using snscrape package in python, store them in a NoSQL database (MongoDB) and analyse them in python using PySpark and visualise the data in Tableau. This includes seven steps mainly,

1. Extracting tweets from twitter
    - Fields like tweet content, tweet date, retweet count, tweet language, user location, user follwers, user verified are used
2. Storing these tweets in MongoDB using Python MongoDB connector
3. Connecting to MongoDB from Python
4. Analysing (pre processing and transforming) the tweets in python using PySpark and generating wordclouds on text data
5. Exporting the summarised data to Google Drive using an API
6. Importing data from Google drive and, 
7. Visualising data in Tableau

Note: The entire pipeline is implemented on a Jetstream VM 1*8-NNHZhRVb5EPHK5iin92Q.png source: google images

Section 2: Background

The main reason behind choosing this topic is that most of us are directly or indirectly affected by Covid and few reasons for the severe spread of the virus have been a) unavailability of vaccines b) reluctance to get vaccinated, c) improper masking techniques d) ineffective border control etc.,. So, i focus on part b i.e., reluctance of people to get vaccinated by understanding the sentiment of people towards various vaccines from the social media platform twitter.

Although the covid vaccines are proven to be quite effective (efficacy rates of ~90-95% in clinical trails) in dealing with the viruses, there has been a lot of misunderstanding, scepticism and stigma in the public around these vaccines because of the spread of misinformation. This has resulted in lower vaccination rates in some of the countries causing increased spread of the virus. So, it is extremely important to understand the sentimenty of vaccines among people and create awareness to contain the spread of virus.

In this project, i primarily analyse the sentiment for different vaccines to see which vaccines people trust the most and how the sentiment has evolved with the vaccination rates to understand if vaccinations lead to positive sentiment.

Section 3: Methodology

3.1 Setting up VM and MongoDB environments

3.1.1 MongoDB setup and configuration

I have used the MongoDB to store the tweets and world vaccination data. I created a free shared account in the Iowa region with 3 clusters and a storage limit of 512 Mb.

MongoDB is a resiliant distributed storage system where the files are stored on different cluster replicas (1 primary and 2 secondary in this case) 1.png

2.png

6.png

3.1.2 Jetstream VM initialization

A PySpark VM instance is launched on Jetstream for the purpose of this analysis. necessary packages like Jupyter Notebook, Py4J are installed.

Specs: m1 quad (4 cpu, 10 GB memory and 20 GB disk size Jetstream%20VM.png

3.2 Data Sources

3.2.1 Extracting tweets from twitter

1. The tweets are extracted from twitter using snscrape for each of the vaccines in the form of a Json object. These tweets are stored in a local folder and are later written to the Mongo DB.
2. Below code shows the tweets are extracted for 'Moderna' keyword between 1/1/2022 and 4/30/2022 and a maximum of 25000 latest tweets are extracted. Similarly, the tweets are extracted for Covaxin and Pfizer vaccines as well.

3.2.2 Vaccination statistics data

The world vaccination data is extracted from https://ourworldindata.org in the form of csv. This contains the daily vaccination stats by each day for each country.

3.3.1 Writing tweet data to MongoDB

The tweets Json data extracted is written to the Covid_tweets database in MongoDB using MongoClient

import json from pymongo import MongoClient myclient = MongoClient("mongodb+srv://svallur:11!!aaAA@cluster0.lf8mp.mongodb.net/Covid_tweets?retryWrites=true&w=majority") db = myclient["Covid_tweets"] Collection = db["moderna"] with open('/home/svallur/moderna_tweets.json') as file: file_data = json.load(file) Collection.insert_many(file_data)

Below image shows the Covid_tweets database after all tghe relevant data is inserted 3.png

4.png

3.3.2 Initialising Pyspark

3.4 Initialising MongoDb connection

3.4.1 Reading and Pre processing vaccines data

The vaccines data is read into python and only relevant columns are included and rest of the columns in the original data are ignored. The same steps are repeated for all three vaccines.

These datasets are converted to Spark dataframes in the subsequent steps

3.4.2 All thgree vaccine datasets are concatenated into a master dataset

3.5 Building NLP model

3.5.1 Sentiment is extracted using TextBlob

The tweets are cleaned to remove any punctuation marks annd then passed through TextBlob function to get sentiment. This new information is added to the datframe

3.6 Google sheet integration

The Google sheet Api is used to write the summarized data to google sheets

Google%20sheets.png

3.7 Google drive API integration

The data is also written to Google Drive as a CSV file to ensure klarge data files are transferred properly

Screen%20Shot%202022-05-01%20at%206.47.56%20PM.png

3.8 Pre processing and transforming data for building Wordclouds

Built customised wordclouds to visualise the frequent words that appear in tweets for each of the vaccines

Section 4: Results

There are mainly two outputs in this project,

a. Wordclouds - looking at what people are talking about
b. Tableau dashboards - Visualising the results from python and vaccination stats

4.1 Visualising Wordclouds

4.2 Visualisation in Tableau

The summarised data that is uploaded to Google sheets and Google Drive is imported into Tableau dashboard through a live connection and dynamic visuals are built on them

Overall%20view.png

Covaxin%20view.png

Vaccination%20stats.png

Section 5: Discussion

Below are a fewkey insights from analysing the latest 25000 tweets for Covaxin, Moderna and Pfizer vaccines,

1. Among these vaccines, covaxin seem to have better sentiment among people, followed by pfizer and moderna which is slightly lagging compared to pfizer
2. The tweets from verified profiles seem to have a better sentiment (positive) than that of non-verified profiles. This could be possible due to the misinformation that is mainly soread from fake/unverified accounts
3. Covaxin seems to be the most frequent appearance in tweets from users with high followers count. This could be mainly driven by popular people from India who have massive following on social media platforms.
4. The recent improvement in sentiment seems to be correlated with the recent increase in vaccinations. 
    For e.g., there is a surge in public sentiment for covaxin on March 30, this could be a driven by increased vaccinations during March 23 to Mar 30, in india.
5. The wordcloud for the negative tweets for Pfizer vaccine show words like death, sick and bad which could be due to the misinformation spread on the platform.
6. The wordclouds for negative tweets for moderna vaccine show words like kids which could be due to the scepticism on vaccine efficacy for children and pfizer being the preferred vaccine among children

Section 6: Conclusion

From the project, what i have noticed is that the extent of vaccinations and sentiment among public could be very likely correlated. However, the insights generated in this project may not be statistically significant given the narrow timeperiod considered, owing to a few constraints. The correlation could be strong because as more and more get vaccinated the spread of virus decreases gradually and people would be much confident on the efficacy of the vaccines. This in turn motivates more people to get vaccinated and a better sentiment among the masses.

There is a enormous amount of data available on the internet these days and it is difficult for the common public to parse through misinformation and spam to get accurate information.

Next Steps: In the future, i would like to extend the scope of the project to a broader timeperiod to look at the evolution of sentiment over the last couple of years. I would also like to include other vaccines like Johnson & Johnson, SPutnik in the analysis for a better understanding of public sentiment.

Section 7: References

Below are a few references used for this project:

1. https://www.mongodb.com/blog/post/getting-started-with-python-and-mongodb
2. https://sparkbyexamples.com/pyspark/pyspark-udf-user-defined-function/
3. https://medium.com/@aieeshashafique/exploratory-data-analysis-using-pyspark-dataframe-in-python-bd55c02a2852
4. https://www.techwithtim.net/tutorials/google-sheets-python-api-tutorial/
5. https://help.tableau.com/current/pro/desktop/en-us/data_explore_analyze_interact.htm
6. https://towardsdatascience.com/sentiment-analysis-of-covid-19-vaccine-tweets-dc6f41a5e1af